Rene Perez.
This Data-visualization project is composed of California Housing Prices from the 1990 Census of the State of California.
The objective is to make use of the toolset and principles of data visualization, displaying and uncovering trends, patterns, tendencies, and outlieres, using ggplot for R, this report will create:
Data trasnformation using functions like, filter, select, group_by and other.
Bar charts,line charts, and others.
Scatter plots, histograms.
Dashboards.
Gggplot is the library used
Coding language is R.
Rstudio is the integrated development environment.
For spatial visualization the package is SF.
Fitting of a Linear Regression Analysis.
The model fits reasonably well (R² ≈ 0.65).
Most variables are statistically significant.
median_income is the strongest positive predictor.
Location features (longitude, latitude, ocean_proximity) are very important.
Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.
| Dependent variable: | |
| median_house_value | |
| longitude | -26,812.990*** |
| (1,019.651) | |
| latitude | -25,482.190*** |
| (1,004.702) | |
| housing_median_age | 1,072.520*** |
| (43.886) | |
| total_rooms | -6.193*** |
| (0.791) | |
| total_bedrooms | 100.556*** |
| (6.869) | |
| population | -37.969*** |
| (1.076) | |
| households | 49.617*** |
| (7.451) | |
| median_income | 39,259.570*** |
| (338.005) | |
| ocean_proximityINLAND | -39,284.300*** |
| (1,744.258) | |
| ocean_proximityISLAND | 152,901.900*** |
| (30,741.880) | |
| ocean_proximityNEAR.BAY | -3,954.052** |
| (1,913.339) | |
| ocean_proximityNEAR.OCEAN | 4,278.134*** |
| (1,569.525) | |
| Constant | -2,269,954.000*** |
| (88,013.880) | |
| Observations | 20,433 |
| R2 | 0.646 |
| Adjusted R2 | 0.646 |
| Residual Std. Error | 68,656.950 (df = 20420) |
| F Statistic | 3,111.608*** (df = 12; 20420) |
| Note: | p<0.1; p<0.05; p<0.01 |
Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎